How to detect UTF-8-based encoded strings [closed]

Posted by Diego Sendra on Programmers See other posts from Programmers or by Diego Sendra
Published on 2013-06-24T19:59:04Z Indexed on 2013/06/24 22:30 UTC
Read the original article Hit count: 400

A customer of asked us to build him a multi-language based support VB6 scraper, for which we had the need to detect UTF-8 based encoded strings to decode it later for proper displaying in application UI. It's necessary to point out that this need arises based on VB6 limitations to natively support UTF-8 in its controls, contrary to what it happens in .NET where you can tell a control that it should expect UTF-8 encoding. VB6 natively supports ISO 8859-1 and/or Windows-1252 encodings only, for which textboxes, dropdowns, listview controls, others can't be defined to natively support/expect UTF-8 as you can do in .NET considering what we just explained; so we would see weird symbols such as é, è among others, making it a whole mess at the time of displaying.

So, next function contains whole UTF-8 encoded punctuation marks and symbols from languages like Spanish, Italian, German, Portuguese, French and others, based on an excellent UTF-8 based list we got from this link - Ref. http://home.telfort.nl/~t876506/utf8tbl.html

Basically, the function compares if each and one of the listed UTF-8 encoded sentences, separated by | (pipe) are found in our passed string making a substring search first. Whether it's not found, it makes an alternative ASCII value based search to get a match. Say, a string like "Societé" (Society in english) would return FALSE through calling isUTF8("Societé") while it would return TRUE when calling isUTF8("SocietÈ") since È is the UTF-8 encoded representation of é.

Once you got it TRUE or FALSE, you can decode the string through DecodeUTF8() function for properly displaying it, a function we found somewhere else time ago and also included in this post.

Function isUTF8(ByVal ptstr As String)
    Dim tUTFencoded As String
    Dim tUTFencodedaux
    Dim tUTFencodedASCII As String
    Dim ptstrASCII As String
    Dim iaux, iaux2 As Integer
    Dim ffound As Boolean

    ffound = False
    ptstrASCII = ""

    For iaux = 1 To Len(ptstr)
        ptstrASCII = ptstrASCII & Asc(Mid(ptstr, iaux, 1)) & "|"
    Next

    tUTFencoded = "Ä|Ã…|Ç|É|Ñ|Ö|ÃŒ|á|Ã|â|ä|ã|Ã¥|ç|é|è|ê|ë|í|ì|î|ï|ñ|ó|ò|ô|ö|õ|ú|ù|û|ü|â€|°|¢|£|§|•|¶|ß|®|©|â„¢|´|¨|â‰|Æ|Ø|∞|±|≤|≥|Â¥|µ|∂|∑|âˆ|Ï€|∫|ª|º|Ω|æ|ø|¿|¡|¬|√|Æ’|≈|∆|«|»|…|Â|À|Ã|Õ|Å’|Å“|–|—|“|â€|‘|’|÷|â—Š|ÿ|Ÿ|â„|€|‹|›|ï¬|fl|‡|·|‚|„|‰|Â|Ú|Ã|Ë|È|Ã|ÃŽ|Ã|ÃŒ|Ó|Ô||Ã’|Ú|Û|Ù|ı|ˆ|Ëœ|¯|˘|Ë™|Ëš|¸|Ë|Ë›|ˇ" & _
                "Å|Å¡|¦|²|³|¹|¼|½|¾|Ã|×|Ã|Þ|ð|ý|þ" & _
                "â‰|∞|≤|≥|∂|∑|âˆ|Ï€|∫|Ω|√|≈|∆|â—Š|â„|ï¬|fl||ı|˘|Ë™|Ëš|Ë|Ë›|ˇ"

    tUTFencodedaux = Split(tUTFencoded, "|")
    If UBound(tUTFencodedaux) > 0 Then
        iaux = 0
        Do While Not ffound And Not iaux > UBound(tUTFencodedaux)
            If InStr(1, ptstr, tUTFencodedaux(iaux), vbTextCompare) > 0 Then
                ffound = True
            End If

            If Not ffound Then
                'ASCII numeric search
                tUTFencodedASCII = ""
                For iaux2 = 1 To Len(tUTFencodedaux(iaux))
                    'gets ASCII numeric sequence
                    tUTFencodedASCII = tUTFencodedASCII & Asc(Mid(tUTFencodedaux(iaux), iaux2, 1)) & "|"
                Next
                'tUTFencodedASCII = Left(tUTFencodedASCII, Len(tUTFencodedASCII) - 1)

                'compares numeric sequences
                If InStr(1, ptstrASCII, tUTFencodedASCII) > 0 Then
                    ffound = True
                End If
            End If

            iaux = iaux + 1
        Loop
    End If

    isUTF8 = ffound
End Function

Function DecodeUTF8(s)
  Dim i
  Dim c
  Dim n

  s = s & " "

  i = 1
  Do While i <= Len(s)
    c = Asc(Mid(s, i, 1))
    If c And &H80 Then
      n = 1
      Do While i + n < Len(s)
        If (Asc(Mid(s, i + n, 1)) And &HC0) <> &H80 Then
          Exit Do
        End If
        n = n + 1
      Loop
      If n = 2 And ((c And &HE0) = &HC0) Then
        c = Asc(Mid(s, i + 1, 1)) + &H40 * (c And &H1)
      Else
        c = 191
      End If
      s = Left(s, i - 1) + Chr(c) + Mid(s, i + n)
    End If
    i = i + 1
  Loop
  DecodeUTF8 = s
End Function

© Programmers or respective owner

Related posts about character-encoding

Related posts about visual-basic-6